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ABSTRACT 

Previous work reports about SXSI, a fast XPath engine which ex- 
ecutes tree automata over compressed XML indexes. Here, rea- 
sons are investigated why SXSI is so fast. It is shown that tree au- 
tomata can be used as a general framework for fine grained XML 
query optimization. We define the "relevant nodes" of a query as 
those nodes that a minimal automaton must touch in order to an- 
swer the query. This notion allows to skip many subtrees during 
execution, and, with the help of particular tree indexes, even al- 
lows to skip internal nodes of the tree. We efficiently approximate 
runs over relevant nodes by means of on-the-fly removal of alterna- 
tion and non-determinism of (alternating) tree automata. We also 
introduce many implementation techniques which allows us to ef- 
ficiently evaluate tree automata, even in the absence of special in- 
dexes. Through extensive experiments, we demonstrate the impact 
of the different optimization techniques. 

1. INTRODUCTION 

The XPath query language plays a central role in XML process- 
ing: it is deeply uprooted in almost every XML technology, starting 
from query languages such as XQuery and XSLT, to access control 
languages such as XACML, to JavaScript engine of popular web 
browsers. Thus, efficient XPath evaluation is essential for any time- 
critical XML processing. In this paper we show how tree automata 
can be used as framework for fine-grained and novel types of XPath 
query optimizations. The experiments with our prototype show 
that, together with appropriate indexes for the XML document tree, 
these optimizations give rise to unprecedented execution speed for 
XPath queries, outperforming the fastest existing XPath engines. 

The first breakthrough in efficient XPath execution was Koch et 
al.'s seminal paper |6| (see also |7|) where it is shown that Core 
XPath can be evaluated in time 0(\D\ ■ \Q\) where jD] is the size 
of the document and \ Q\ is the size of the query. Core XPath refers 
to the tree navigational fragment of XPath. Considering the time 
bound of Koch's algorithm, there are two obvious ways of reducing 
this complexity in practice: 

(1) reduce the number of query steps ("jQj -optimization") and 

(2) reduce the number of nodes to consider ("|D|-optimization"). 



Extreme |Q|-Optimization: A top-down deterministic tree au- 
tomaton (TDTA) processes an input tree starting in its initial state, 
at the root node. It then applies a unique rule which says, for a 
given state and label of a node, how to process the children of that 
node. A node is selected as a result, if the unique state reached by 
the automaton on that node and the label of that node are elements 
of a special "set of selection pairs". After compiling a (restricted) 
XPath query into such an automaton (which takes 0(|Q|) time), the 
run function only requires a single look-up at each node of the input 
tree (plus possibly an insertion of the current node into the result 
list. Since the function visits the nodes in document order and only 
once, this insertion can be performed in constant time, keeps the re- 
sult sorted and duplicate-free). Thus, the evaluation runs in 0(|Z)|) 
time, giving the extreme case of -optimization to |(3| = 1. Sim- 
ilar automata for XML processing have been considered (12f[14| . 
However, implementations of such automata cannot compete with 
state-of-the-art XPath engines. The reasons for this deficiency are 
that (1) performance depends on the speed of firstChild and nextSi- 
bling operations in the XML tree data structure, (2) the automaton 
needs to visit every node of D and (3) the compilation into TDTA 
only works for a very restricted subset of Core XPath. 

To address (1), many implementations use in-memory pointer 
structures. However, this blows up the memory requirement by a 
factor of 5-10 over the size of the original XML document. Hence, 
such implementations can only work over small documents. We 
solve this problem by using state-of-the-art succinct trees |I8| , a 
recent development in data structures. 

Solutions to problems (2) and (3) are the main subject of this 
paper. We study ways to restrict the nodes of the document which 
must be visited by the run function of the automaton. This gives 
rise to the notion of relevant nodes, one of our key contributions. 
To address (3), we work with non-deterministic alternating tree au- 
tomata and carefully develop on-the-fly determinization and alter- 
nation elimination algorithms. This allows to retain most the bene- 
fits of deterministic automata while increasing the expressive power 
to full Core XPath. Altogether, our implementation of these solu- 
tions to (I) - (3) provides XPath execution speed competitive with 
the best known engines 1 1 1. While we restrict ourselves for didactic 
reasons to a fragment of Core XPath, our prototype "SXSI" imple- 
ments Core XPath plus text predicates [l]; we are currently adding 
other XPath 1.0 features such as number functions and aggregates. 

I D| -Optimization using Relevant Nodes Consider the query 
Qo = llallh which selects all b-descendants of a-labeled nodes. 
A TDTA for this query starts at the root in a state go- When it 
encounters an a-node it changes to a state q\ . Any b-node encoun- 
tered in q\ is selected as result. For such an automaton we say 
that a node is relevant, whenever the automaton changes state, or 
selects a node. Thus, all top-most a-nodes and all their b-labeled 



descendants are relevant. Note that for this query, one could use 
the staircase join |9j to restrict the set of all a-nodes to the top- 
most ones, and only then select b-descendants; in this way only 
the relevant b-nodes are touched (but some non-relevant a-nodes 
might be touched in the first step). Here, we first give an algorithm 
that executes an arbitrary TDTA so that only relevant nodes are 
visited. This is achieved by executing the automaton over an index 
that allows at any node to "jump" to the next cr-labeled descendant 
(for any label o) or to the next cr-labeled following node (accord- 
ing to XPath), for any a. For bottom-up deterministic tree automata 
(BDTA), we can define relevant nodes in a similar way. We sketch 
an algorithm for BDTAs that only touches relevant nodes, given an 
index that allows access to all bottom-most nodes with a given la- 
bel and allows to jump to labeled ancestors (due to space constraint 
and the fact that the bottom-up algorithm has to handle more cases 
than the top-down one to ensure that nodes are only visited once, 
we do not give it fully in this paper). 

Given a query, it is not always possible to determine which one of 
the bottom-up or top-down evaluation is the most efficient {i.e. vis- 
its fewer nodes). For instance, for query Qq, if the input document 
has less b nodes than a nodes, a bottom-up traversal seems more 
efficient. Following this idea, we extend our evaluation algorithm 
to support Start Anywhere Runs: for a query such as //a//b//c, if the 
global count of b-nodes is low, we can jump to these b-nodes, and 
from there execute simultaneously a bottom-up run which checks 
for a-nodes and a top-down run which selects c-nodes. 

Non-Deterministic Automata To determine the relevant nodes 
for a TDTA or BDTA, we actually first have to minimize the au- 
tomaton. Intuitively, a non-minimal automaton can do many use- 
less state-changes. While minimization can efficiently be done for 
deterministic automata, it poses a big problem for non-deterministic 
automata. Here, minimization is EXPTIME-complete, and, there 
need not even exist a unique minimal automaton. Unfortunately, 
for XPath we must deal with non-deterministic automata: consider 
Qi = //a[.//b]//c. If we execute it top-down and are below an a- 
node, then for a c-node we cannot know whether to select it (this 
depends on the presence of b-nodes which might be below). Simi- 
larly, the query //a//c cannot be done in a deterministic bottom-up 
way. There is an elegant way to characterize relevant nodes for non- 
deterministic automata, using equivalence between sub-automata. 

This notion proves too complex to implement in practice (equiv- 
alence is EXPTIME-complete), but we give an on-the-fly algorithm 
which soundly approximates the relevant nodes of a nondetermi- 
nistic tree automaton, while evaluating the automaton on an input 
tree. Our experiments show that for typical XPath queries our on- 
the-fly algorithms perform well: the approximation of the set of 
relevant nodes that we compute is close to the real set allowing us 
to only visit a small fraction of the complete document. 

Plan Section |2] gives the definitions and introduces our model 
of selecting tree automata. Section[3]formally defines the concept 
of relevant nodes and studies two optimal algorithms for minimal 
top-down and bottom-up selecting tree automata. Section |4] intro- 
duces our variant of alternating tree automata, their encoding of 
XPath queries, and presents the approximating algorithm as well 
as a collection of implementation techniques. The impact of these 
techniques is validated by experiments given in Section [5] Some 
non-crucial aspects are detailed in the Appendix. 

Related Work 

Skipping of complete subtrees has been considered before, in sev- 
eral different contexts. For instance, the application of the staircase 
join |9| can be seen as an instance of skipping: for the descen- 
dant axis, only the top-most independent context nodes are consid- 



ered, i.e., their subtrees are skipped; in a similar way, even ancestor 
paths can be skipped by this join. Skipping of subtrees is also com- 
mon practice in advanced compilers for pattern matching in pro- 
gramming languages. In [ 1 1 J selecting tree automata are compiled 
into mutually recursive functions of an ML-style target language. 
They define "loop breaker" states, intuitively, a state with transition 
q, I — >■ [q, q). This is similar to non-relevant nodes, according to 
our definition, and is used there to enforce the termination of the 
generated code There is a large body of work on optimizations for 
evaluation of attribute grammars (see, e.g., (15| ) some of which 
correspond to skipping of subtrees; note that attribute grammars 
can simulate selecting TDTA and BDTAs. In |5 1 automata are used 
for tree pattern matching and subtrees are skipped according to type 
information. Tree automata have been used for XPath, but mainly 
in the context of streaming: Koch (I0| runs BDTAs over a reversed 
XML document followed by a top-down run, to evaluate XPath. 
Suciu et al. 1 8 1 use automata to evaluate many queries in parallel, 
over a stream. We are not aware of any work that executes automata 
over tree indexes, such as we do. In fact, even for usual DFAs over 
strings, there is no prior work on executing DFAs or evaluating 
regular expressions over indexed strings (where the index allows 
to skip regions of the string, based on labels); the closest work 
is |2 |. Also comparable is the idea of running DFAs on grammar- 
compressed strings. The THOR system (16||17| , uses data struc- 
tures that support the same jumping operations as we do. However, 
they do step-wise evaluation of XPath a la Koch and therefore can- 
not use these structures to restrict evaluation to only relevant nodes. 

2. SELECTING TREE AUTOMATA 

We define our notion of tree automata over binary trees. When 
applying them to XML we use the well-known "first-child/next- 
sibling" encoding: the first-child of a node in the XML tree be- 
comes the left child in the binary tree, and the next-sibling of a 
node in the XML tree becomes the right child in the binary tree. 
We also do not consider text nodes or attributes (but a straightfor- 
ward encoding is given in Q). Let E be an alphabet, i.e., a finite 
set of symbols. The set of binary trees over E, denoted T(E), is 
the smallest set T such that (i) the leaf symbol # is in T and (ii) if 
t\,t2 G T and Z £ E, then l{t\,t2) is in T. In the examples, we 
will often omit # for concision. A node is a finite (possibly empty) 
sequence over {1, 2}. For a given tree t G r(E) its set of nodes, 
denoted T>om(t), is the smallest finite set such that (i) the empty 
sequence e is in T>om{t) and (ii) if two sequences tt ■ 1 and n ■ 2 are 
in 'Dom{t), then n £ 'Dom{t). The label of the node n in the tree t 
is denoted by t(7r); for t = l{ti,t2) it is defined as / if vr = £, and 
as tiiji') if-R = i ■ n'; moreover, for t = # we have t(e) = As 
we can see, e denotes the root node, and n ■ 1 and vr ■ 2 denote the 
left and right-child of the node n, respectively. When talking about 
Ihe followings of a node n, we mean all the nodes visited after it 
during a pre-order traversal, that are not descendants of n. 

Definition 2.1 A selecting tree automaton (STA) ^ is a 6-tuple 
(E, Q, T, B, 5, S) where E is an alphabet of input symbols, Q is a 
finite set of states, 7" C Q is the set of top states, S C Q is the set 
of bottom states, <S C Q x E is the set of selecting configurations, 
and 5 is a finite set of transitions. A transition is tuple (g, L, gi , 92), 
where q,qi,q2 G Q and L is a non-empty subset of E. 

From now on we let A = (E,Q,T,B,S,S) be a fixed (but 
arbitrary) automaton, unless otherwise specified. We often write 
q,L^ {qi,q2) to denote that [q, L,q\,q2) G 5, and similarly 
q,L ^ (51,92) to denote that (g,L, 51,92) G S and {q,l) G S 
for every I G L. Before defining the semantics of A via runs, we 



fix a few useful definitions, het q,qi,q2 G Q and Z G E. The 
destination and source states, denoted S{q, I) and S{qi, q2, 1), re- 
spectively, are defined as 

5{q, I) ={(g', q") | 3L C E s.t. Z G L and {q, L, q' , q") G S} 
S{qi,q2, l)={q | 3i C E s.t. / G L and (g, L, gi, 52) G S}. 

An automaton ^ is a top-down deterministic selecting tree automa- 
ton (TDSTA) if 7" is a singleton and, for every q £ Q and Z G E, 
S{q, I) is a singleton. Similarly, ^ is a bottom-up deterministic se- 
lecting tree automaton (BDSTA) if S is a singleton and, for every 
gi, g2 G Q and / G E, S{qi,q2, 1) is a singleton. Note that if 5 is 
empty, then a TDSTA is exactly the same as a classical determin- 
istic top-down tree automaton (TDTA): the single state in T is the 
initial state and the states in B are the final states; similarly, a BD- 
STA is a classical deterministic bottom-up tree automaton (BDTA): 
the single state in B is its initial state and the states in T are its final 
states. The semantics of an STA is given by the set of trees it recog- 
nizes (as for usual tree automata) and by the set of nodes it selects. 
To formalize these notions, we introduce the concept of run. 

Definition 2.2 (Run of an STA) Let t G T(E). A run of A over t 
is a total function R : T>om{t) — )■ Q such that for all tt G T>om{t) 
with t(7r) G E, 

R{-k) G 5{R{'K-l),R{-K-2),t{'K)). 
The run R is accepting if and only if 

• R{e) G T 

• for all TV G Vom{t) with t{Tv) = #, R{-n-) G B. 

We denote by R\ the set of all accepting runs of A over t. 

An STA is top-down complete, if for every q £ Q and I G E, 
(5(g, I) is non-empty. Similarly, an STA is bottom-up complete, if 
for every gi,g2 G Q and / G E, 5(gi,g2,/) is non-empty. Top- 
down complete TDSTAs A and bottom-up complete BDSTAs have 
a unique run for any input tree t. 

Definition 2.3 Let A be an STA. The language of A, denoted C{A), 
is the set 

C{A) = {te r(E) I R\ / 0}. 

The set of selected nodes of A, denoted A{t), is the set 

A{t) = {tt G Vom{t) 1 {R{-K),t{-K)) G 5andi?G Ra}- 

We say that two STAs A and A' are equivalent, denoted A = A' , 
ifC{A) = C{A') and for every t G T{Y.),A{t) = A'{t). 

Example 2.1 (STA for //a//b) 

A/a//b = ({a,b,c},{go,gi},{go},{go,i7i},{(i7i,b)},(5) 

E Q T B S 

J ^ 90, {a} -S>((2i,c?o) 91, {b} =>(<?!, gi) 
(JO, S \ {a}-H>(qo,i?o) \ {b}^(<ji,gi) 

The TDSTA Aiiaiih of Example |2.1| is not deterministic bottom-up. 
This is because its set B of bottom states is not a singleton. In fact, 
we claim that there does not exist any BDSTA that is equivalent 
to Aiiaiib, i.e., which selects the same nodes. Intuitively, when a 
bottom-up automaton sees a b-node, it does not know whether this 
node should be accepted or not (this depends on the existence of an 
a-labeled ancestor). We claim similarly that there exists BDSTAs 
for which there is no equivalent TDSTA. The automaton imple- 
menting the query //a[.//b] is such an example (which we detail in 
Appendix [Ajl. To conclude with the formal definitions, we charac- 
terize several kinds of states that we use in the following sections. 



Definition 2.4 Let A be an STA. A state g G Q is non-changing if 
and only if G E, 5{q, I) = {(g, g)}. For a non-changing state g, 
if g G B, g is a top-down universal state; if g G 7", g is a bottom-up 
universal state; if g ^ g is a top-down sink state; if g ^ 7", g is 
a bottom-up sink state. 

Minimal Selecting Tree Automata In Appendix |A.2| it is shown 
that for every TDSTA (resp. BDSTA) there is a unique minimal 
one, where minimal means with the smallest number of states. For 
a minimal TDSTA A: (i) at most one state is top-down universal 
state and (ii) at most one state is a top-down sink state. If any of 
these states exist, then we denote them by gx and gx , respectively. 
The similar properties hold for BDTAs. Another property that will 
be important for us in the next section is that, in a minimal TDSTA 
or BDSTA, if a state g is not in {gx, g±}, then there must exist a 
label / such that 5(g, I) contains a pair different from (g, g). We say 
that / is an essential label for q (in A). 

3. RELEVANT NODES 

As we have explained in the Introduction, our goal is to improve 
query answering time by reducing the number of nodes that have 
to be visited by the evaluation function. A common optimization 
technique for tree automata (especially used in pattern-matching 
and type-checking), is to avoid visiting a subtree. For instance, con- 
sider the simple DTD "< ! ELEMENT a ANY>" which states that 
an input document must have an a-labeled root node and any well- 
formed content below it. A recognizer automaton which checks the 
validity of a tree against this DTD is 

-4 = (E, {go, gx, g±}, {go}, {gx}, 0, 5) 
Q T B s 

g ^ 90, {a} ->(qx,<?x) gx,S->(gx,<?x) 
<?o, S \ {a}-!>((jx,'?x) 9x,2-!>((3x,'J±) 
Since the automaton only changes state at the root node, only this 
node is "relevanf ; no information is gained at any other node. A 
clever evaluator may skip all non-relevant subtrees. As we can see, 
whenever the automaton enters a non-changing state, we can skip 
the current subtree. Of course, there are automata equivalent to the 
one above which change state in the subtrees under the root node 
(even though this is not "required"). How can we make sure that 
our automaton only changes state when this is really necessary? 
The answer is simple: we minimize the automaton. If the minimal 
automaton changes state, then any other automaton for the query 
does too; thus it uniquely determines the relevant nodes. More- 
over, as mentioned after Definition |2.4| the minimal automaton has 
at most one state gx and one state gx . It is therefore easy to deter- 
mine when a subtree can be skipped. Of course, in a selecting tree 
automaton, all selected nodes must be relevant, because we cannot 
select them without visiting them. Consequently, given a TDSTA 
A and a tree t we say that node vr of t is relevant if the minimal au- 
tomaton Amin of A changes state at n. We now give a general def- 
inition that can be used for non-deterministic automata; instead of 
minimality, the definition uses equivalence between sub-automata. 

Definition 3.1 (Relevant nodes) Let A be an STA. Let t G r(E) 
and R G R\. Let tt G Vomit) such that tt ■ 1 G Vomit) and 
TT • 2 G Vomit). The node n is relevant for the run R if and only if 
either {R(-K),t(-n)) G 5 or none of the following hold: 

• AiRi-n)] = A[R(tt ■ 1)] = A[R{7V ■ 2)]; 

• ^[-R(7r)] = A[R{-K ■ 1)] and A[R{-k ■ 2)] = At\ 

• ^[-R(7r)] = A[R{tt ■ 2)] and A[R{-k ■ 1)] = ^x; 

where ^x is such that C{At) = T{T.) and for all t G r(E), 
^x {t) = 0. A[q] denotes the restriction of ^ to g (i.e. where T is 
replaced by {g}) and is formally defined in Appendix [a[ 



This definition generalizes the intuition we gave earlier. First, a se- 
lected node is relevant. Then, a node can be skipped (i.e. is not 
relevant) if the automaton performs the same computation on the 
node and on both its children (informally the automaton "loops" 
both on the left and right child). Or a node can be skipped if the 
automaton loops on the left child and "ignores" the right child, i.e. 
is in a state that accepts r(E) and does not mark any node. Sym- 
metrically, a node can be skipped if the automaton loops on the 
right child and ignores the left one. While Definition [TT] gives a 
proper semantic characterization of relevant nodes, we cannot use 
it to derive an efficient evaluation procedure for STAs since: 

(i) it requires the accepting run to be known, while we want to 
deduce relevant nodes while computing the run; 

( ii) it checks for equivalence of sub-STAs, an EXPTIME-complete 
problem, even for recognizers. 

We present two exact algorithms for particular STAs, namely min- 
imal TDSTAs and minimal BDSTAs, and show how a particular 
index can be used to skip not only subtrees but also internal nodes. 

3.1 Deterministic Top-Down Evaluation 

3.1.1 Top-down Relevance 

As we have explained, testing the relevance of a node in the ac- 
cepting run of an automaton A consists in checking the equivalence 
of several sub-automata. It is possible to perform this check effi- 
ciently for minimal TDSTAs. Indeed, in a minimal TDSTA, q rec- 
ognizes r(E) if and only if 5 is a top-down universal state. More 
generally, given two states q and q' of A: 

A[q] ^ A[q'] ^q^q'. 
This is a consequence of the definition of a minimal automaton. 
Given a TDSTA and a run, we can easily characterize the set of 
relevant nodes: 

Lemma 3.1 (Top-down relevant nodes) Let A be a minimal top- 
down complete TDSTA, t G R £ R\ and n G T)om{t) such 
that TT • 1 G T>om[t) and n -2 £ T>om{t). tt is top-down relevant in 
R if and only if either (_R(7r), t{n)) G S or if none of the following 
hold: 

• R{-k) = R{tv ■ 1) = R{tt ■ 2) 

• R{n) = R{n ■ 1) and R(tt ■ 2) = qt 

• R{n) = R{n ■ 2) and R(n ■ 1) = qr 

For a given run of a minimal TDSTA, the relevant nodes are either 
the selected nodes or nodes for which a state-change occurs. An 
important observation is that for TDSTAs, a state change is exactly 
determined by the set of essential labels. For instance, in the au- 
tomaton A//s.//h of Example |2.1[ the set of essential labels for state 
qo is {a}: the automaton changes state only if it encounters an 
a-labeled node during the top-down run. 

3.1.2 Top-Down Jumping Functions 

Based on this observation, we define particular jumping func- 
tions in a tree which extend the basic firstChild and nextSibling 
moves. The implementation of such functions using state of the art 
tree indexes is later discussed in Section|5] 

Definition 3.2 (Top-down jumping functions) Let t be a tree in 
T(E). We define the functions dt, ft, 1 t, Tt as: 

• dt : Vom{t) X 2^ Vom{t)u{0,} where di(7r, L) returns the 
first descendant n' of vr (in document-order) such that t(n') G L; 

• ft : Vom{t)x2^ xVom{t) 2?ora(t)u{Sl} where ft (tt, L, ttq) 
returns the first following node tt' of n such that n' £ L and tt' 
is a descendant of ttq. 



• It : Vom{t) X 2^ T)om{t) U {Q.} where lt(7r,L) returns 
the first descendant tt' of tt whose label is in L and such that 
tt' = TT ■ 1 . . . ■ 1 (left-most path); 

• Vt ■■ T>om{t) X 2^ — ^ T>om{t) U {Q,} where rt(7r,L) returns 
the first descendant tt' of tt whose label is in L and such that 
tt' = TT ■ 2 . . . ■ 2 (right-most path). 

All these function returns a special error node Q, if there is no tt' G 
T>om{t) which fits their definitions. 

Using these functions, the set of top-most nodes ttq, . . . , 7r„ whose 
labels are in L, in a subtree rooted at tt can be computed by: 
TTo = df(7r,I/) and then 7r„+i = ft(7r„, L, tt), until 7r„ = SI. 

3.1.3 Jumping Top-Down Algorithm 

We use the jumping functions defined in the previous section to 
compute a partial run for a minimal TDSTA and an input tree t. 
More specifically, the algorithm returns a mapping from nodes to 
states. If there is no accepting run, the algorithm aborts and re- 
turns an empty mapping. We describe informally the algorithm (its 
pseudo code is given in Appendix |B7TJ. The algorithm is imple- 
mented by the mean of a recursive function topdownjump which 
takes as argument a node n in the input tree t and a state q (ini- 
tially the root node e and the initial state go of the TDSTA). This 
function works like the usual top-down evaluation procedure for 
a TDSTA. First, if tt is a leaf (a ^-labeled node in our context) 
then the automaton checks whether g G S. If this is the case, the 
function returns the mapping {tt i-> g} and fails otherwise. More 
interestingly if n is not a leaf, then function computes the states 
(gi, g2) = S{q, t{n)). If either gi or g2 is the sink state, then the 
function fails (there is no accepting run). Otherwise, the function 
performs a case analysis on g^ to determine the set of top-most rel- 
evant nodes in the subtree rooted at tt ■ i (for i G {1, 2}). The 
function considers the three cases given in Lemma [3TT| 

• qi,L' — > {qi,qi) and qi,L W ,<l") with q' or g" distinct 
from qi . The function performs its recursion on all the top-most 
descendants of tt ■ z whose label is in L; 

• qi,L' — > (gi,gT) and qi,L {q',q") with g' distinct from 
qi. The function is called recursively on the node It (tt ■ i, L) (the 
automaton loops on the left-most path below the current node). 

• qi, L' — >■ (gr, Qi) and g^, L — ^ (g', g") and g" distinct from g;. 
The function is called recursively on the node rt(7r • i, L) 

If none of the above hold, tt • i is relevant and the function is recur- 
sively called on TT • i itself. Lastly, the function returns the mapping 
{tt g} augmented by the mappings returned by the recursive 
calls on the left and right subtrees. This function computes the op- 
timal traversal with respect to relevant nodes: 

Theorem 3.1 Lett£T{j:). Let Abe a minimal TDSTA. Let R be 
the run of A over t and R' — topdown_jump{t , A). 

» if R is an accepting run, then for all n G T>om(t), R' ij^) = 
-R(7r) if an only ifix is top-down relevant for R; 

» if R is not an accepting run, then R' = 0. 

3.2 Deterministic Bottom-Up Evaluation 

While a top-down run of an automaton can be translated into a 
natural top-down tree traversal, bottom-up runs are more compli- 
cated. Assuming that a parent move and access to the sequence of 
leaves of an input tree are supported, we can devise a "pure bottom- 
up" evaluation function, which starts from the sequence of leaves 
and works its way up to the root. The pseudo code of this algorithm 
is given in Appendix |B .2| From the sequence (tti, go), ... , (7r„, go) 



of all leaves vr^ and initial state go the algorithm proceeds to "re- 
duce" them (by replacing two siblings by their parent and corre- 
sponding state) until the root node is obtained. If the first two nodes 
in the current list are not siblings, the algorithm first reduces re- 
cursively the tail of the list, pushes back the first element on the 
reduced tail (whose size decreased) and reduces the new list. For 
BDSTA, relevance is once again defined in terms of state change, 
but in a more complex way. 

Lemma 3.2 (Bottom-up relevant nodes) Let A be a bottom-up com- 
plete minimal BDSTA. Let B = {9o}- Let the a tree. Let R be the 
accepting run for A and t (if it exists). Let n G T)om{t) such that 
TT • 1 G T>om{t) and tt • 2 G T>om{t). The node n is relevant if and 
only if (-R(7r) , t (tt)) G S or none the following conditions holds: 

• R{n) = qr 

• R{tt) = R{tv ■ 1) = R{tv ■ 2); 

• R{n) = R{n ■ 1) andR{n ■ 2) G {gcgr}; 

• R{n) = R{n ■ 2) and R{n ■ 1) G {go, gr}; 

We do not give the proof that these conditions on states coincide 
with the relevance of nodes as given by Definition |3.1[ but illustrate 
them by an example given in Appendix |B.2| 

In the same way we generalized firstChild to dt and It and nextSi- 
bling to ft and rt for the top-down case, the moves used in the 
bottom-up algorithm can be generalized. The sequence of all leaves 
is replaced by the sequence of bottom-most nodes with a particular 
label and the parent move can be replaced by either a jump to an 
ancestor with a particular label, or the restriction of this jump to the 
left-most or right-most path leading to the current node. Also, test- 
ing whether two nodes are siblings in generalized into getting the 
common ancestor of two nodes. We dub the generalized bottom- 
up jumping algorithm bottomupjump, but the many cases it han- 
dles (intuitively, when trying to jump above two nodes tti and 712 
we must not jump above their common ancestor, or we could miss 
some nodes) makes its presentation verbose even in the form of 
pseudo-code. Second, the tree indexes that we use in our imple- 
mentation do not implement the ancestor jumps efficiently (they 
amount to a sequence of parent calls). We therefore limit ourselves 
to state the existence of algorithm bottomup-jump, and give its the- 
oretical properties: 

Theorem 3.2 Lett£T{T.). Let Abe a minimal BDSTA. Let R be 
the run of A over t and R' = bottomup-jump{t, A). 

• if R is an accepting run, then for all n G T)om(t), R'{tt) = 
7?(7r) if an only ifn is bottom-up relevant for R; 

• if R is not an accepting run, then R' = 0. 

4. AUTOMATA FOR XPATH 

We present in this section our compilation target for XPath ex- 
pressions, namely alternating selecting tree automata (ASTA). We 
then consider a particular fragment of XPath for which we illustrate 
our compilation scheme. Afterwards we introduce a technique for 
evaluating an ASTA in a jumping fashion, using a sound approxi- 
mation of the sets of relevant nodes of the query. We also present 
various implementation techniques to further improve the complex- 
ity in practice of the evaluation of ASTAs. 

4.1 Alternating Selecting Tree Automata 

We introduce a compact variation of STAs which works with 
Boolean formulas over states. 

Definition 4.1 (Alternating Selecting Tree Automata (ASTA)) 

An ASTA A is a. tuple (E, Q, T, S), where E is the alphabet of 



input symbols, Q is the finite set of states, T C Q is the set top 
states, and 5 is a set of tuples (g, L, t, (f>), called transitions, where 
q & Q, L <Z "E, T £ {— =>} and is a. Boolean formula generated 
by the following EBNF. 

cj> ::= T|±|,^V0|0A0H<^Uig| lag (g G Q) 

The semantics of such automata combine the rules for a classical 
alternating automaton, with the rules of a selecting tree automaton. 
The complete rules for the evaluation of formula and the selection 
of nodes is given in Appendix [C] 

4.2 From XPath to Automata 

The fragment of XPath we consider in this presentation is the 
forward fragment of Core XPath, containing descendant and 
child axes as well as arbitrarily nested predicates using or, and 
and not Boolean connective over path expressions. The full EBNF 
description of this fragment is given in Appendix |C] We illustrate 
how to compile an XPath expression of this fragment into an ASTA. 



Example 4.1 (ASTA for tlie query //a//b[c )] Let 

A/a//b[c] = (E,{go,gi,g2},{go},'5) 

where 5 is: 



go, <?i 
go, S -!>4,i <joV 4,2 go 



gi, {6}=>4.i g2 

gi, E giV 4,2 gi 



92, {c}- 
92, S - 



>i2 92 



It is easy to see with this example that such automata can be built 
by a simple traversal of the parse tree of the XPath query. The 
compilation scheme we follow associates one state for each step of 
the query, and each state has at most two transitions. The first one 
represents a "progress" from the current step to the next step (in the 
XPath query). The second transition represents a recursion on the 
first child, the second child or both. Note that non-determinism is 
used here in an essential way. For instance, in .4//a//b[c], in state gi, 
if the current node is labelled b, then the automaton selects a node 
if its first child is in state g2 and at the same time remains in state 
gi for both the first child and the second child. 

While this automaton does not seem to justify the use of alterna- 
tion, we give in Appendix [Cja query whose corresponding ASTA is 
linear in size but whose STA (even non-deterministic) is exponen- 
tially larger. 

On this example, we observe that the particular ASTAs we con- 
sider share many common traits with the minimal deterministic 
TDSTAs of Section [3TT] First a state change occurs whenever the 
automaton gains new knowledge toward answering the query. Sec- 
ond, a top-down universal state correspond to the presence of T in 
a formula (that is, (gr, gx)) or the absence of a or ^2 move (for 
instance ^2 q is the counterpart of (gy , g) in our previous model). 
In such automata, a state change has the same meaning as in a min- 
imal deterministic one. 

4.3 Bottom- Up Evaluation with Top-Down Pre- 
processing and Jumping 

Before discussing how to evaluate such automata using only rel- 
evant nodes, we give a "non-jumping" run function for ASTAs. 

Algorithm 4.1 (Evaluation of an ASTA) 

Input:^ = (E,Q,r, (5),t, TT, r Output: T 

where A is the automaton, t the input tree, r a set of states and 

r is a result set. Initially 7r = e and r = T. 



{90} 



{90 -gi} 



I ^90,131,92} 
{90. 91, 92 — — , 



{90,91}^ 
l' I {90j^9i^,92} 



{90}, {a} 
{qo},S\{a} 

{9o,gi},{f'} 

{qo,9l},S\{b} - 

{90, gi, 92}, {ft} 

{90, <?l, 92}: {c} 
{90, 91, 92}, s \ {ft}- 



•{90, 9l}, {90} 

■{90}, {90} 

•{90, 91, 92}, {90, 91} 
■{90, 9l}, {90, 91} 
■{90, 91, 92}, {90, 91, 92} 
•{90, 91}, {90, 91} 
■{90, 91}, {90, 91, 92} 



Figure 1: Top-down approximation for //a//b[c] and cor- 
responding jumps 

1 function evaLasIa [A, t, n, r) = 

2 if t[7r) = # then return else 

3 let trans = {(5, L, t, </>) £ (5 | g £ r and t(7r) G L} in 

4 let Ti = {q \Xi q & ^ trans} in 

5 let Fi = eval_asta (A,t,-K ■ l,ri) 

6 and r2 = evaLasta (A, t, tt ■ 2, r2) 

7 in return eval_trans(ri, tt, trans) 

The function evaLasta evaluates an ASTA over an input tree t. It 
returns a result set F which is a mapping from states to the sets 
nodes selected in that state. In the usual non-selecting, algorithm, 
F is simply the set of states which accept the current node tt. 

We have already described in details how node selection works 
for such automata in QJ, we focus on the main novelty of this work, 
relevant node approximation. The interested reader can refer to 
Appendix [C] for the complete semantics of ASTA (including node 
selection) as well as a commented example. This process is ab- 
stracted by the function evalJrans on Line 7 which handles both 
selection and evaluation of formulas. 

The parameter r of the function evaLasta allows one to restrict 
bottom-up runs of A to only those which end-up in a top-state at 
the root node. What this algorithm does is to run first a determinis- 
tic top-down automaton .4approx during the recursive descent. This 
automaton is a sound approximation of A in the sense that for any 



t e T(E), t i 



t (f: C{A). We can make further 



use of this automaton .4approx by only jumping to a super-set of its 
relevant nodes. 

Definition 4.2 (Top-down approximation) hsiA — (S, Q, T, 5) 
be an ASTA. The top-down approximation of A is the automaton 

tda{A) = (E,2'3,{r},<5a) where 

<5a = {(S,<7,->,Si,S2)|Scg,aGE, 

s, = {qeQ\^q' eS,Uqe5{q',o)}} 

The exponential blow-up exhibited by this construction is avoided 
by computing the top-down approximation on-the-fly. The interest- 
ing part is now: what relevant nodes can be computed — and there- 
fore which jumps can be performed — if we consider the states in 
tda[A). Figure[T]illustrates the top-down approximation for the au- 
tomaton A/a//b[c] as well as the jumps that can be computed from 
its non-changing states. As we can see in the figure, the top-down 
approximation allows us to jump quite precisely in the tree. If the 
destination state for a subtree is {50} the automaton can jump to the 
top-most a node in the subtree. If the destination state is {go, 
the automaton can jump to a top-most b node in the subtree. If the 
destination state is {go, Qi, 52}, no jump is possible, the automa- 



QOl /site/regions 

Q02 /site/regions/europe/item/mailboxymail/text/keyword 

Q03 /site/closed_auctions/closed_auction/annotation/description/pailist/listitem 

Q04 /site/i"egions/*/item 

Q05 //listitem/Zkeyword 

Q06 /site/i"egions/*/item//keywoi"d 

Q07 /site/people/pei"son[ address and (phone or homepage) ] 

Q08 //listitem[ .//keyword and .//emph]//parlist 

Q09 /site/regions/*/item[ mailbox/mail/date ]/mailbox/mail 

Q10/site[ .//keyword] 

Ql 1 /site//keyword 

Q12 /site[ .//keyword ]//keyword 

Q13 /site[ .//keyword or .//keyword/emph ]//keyword 

Q14 /site[ .//keyword//emph ]/descendant;:keyword 

Q15 /site[ .//*//* ]//keyword 

Figure 2: Tree queries used in tlie experiments 

ton must perform a firstChild or nextSibling move. However, once 
in state {go, gi, 92}, if the label is c then the automaton returns in 
state {go, gi} and can therefore jump to find new b nodes. 

4.4 Implementation Techniques 

Hybrid Evaluation The main drawback of the top-down approx- 
imation of relevant nodes is to force a "top-down view" of the 
query. For instance for query //a//b[c], if a document contains a 
lot of a-nodes and few b nodes, the former ones will be needlessly 
visited since they are part of the top-down approximation of the 
relevant nodes. To alleviate this problem, we propose an alterna- 
tive evaluation strategy dubbed hybrid evaluation. The idea is to 
start anywhere in the query and the document. In the case of query 
//a//b[c], this means starting evaluation at all b-nodes in the doc- 
ument, and check in a recursive top-down-l-bottom-up fashion the 
filter "[c]" in their subtrees and the path "//a" in their upward con- 
text. Such strategy can be effective if the count of b-nodes is low. 
Memoization If we consider Algorithm |4.I[ we see that the compu- 
tations performed at Line 3 (and 7) have complexity 0(|5j). They 
contribute the \ Q\ factor to the complexity 0(|(5| • |D|) of the eval- 
uation function. We can memoize these computations which only 
depends on r and i(7r) for Line 3 and r,t{-K), ri and r2 for Line 7. 
This technique amortises the \ Q\ factor over the whole run: except 
for a few "warm-up" nodes for which the all the transitions must be 
scanned, the rest of the run consists of a succession of look-ups in 
a table, one for each node visited during the run. 
Information Propagation During the traversal, a node is "seen" 
three times by the evaluation function: (i) when reaching the node 
during the top-down traversal, (ii) when returning from the evalu- 
ation of the first child (Hi) when returning from the evaluation of 
the second child. Instead of waiting (Hi) to evaluate the transitions, 
we can already evaluate them in (ii) having only the knownledge 
for the first child. This reduces the number of states to verify while 
visiting the second child. In particular it ensures that for an XPath 
predicate, only one witness is checked by the automaton, the first 
one in pre-order (existential semantics). This is inspired from the 
evaluation of Non-Uniform Automata of (5j. 
Result Sets Since the nodes are traversed in document order and 
only once, result sets can be implemented as simple lists with con- 
stant time concatenation for the union of two result-sets. 

5. EXPERIMENTS 

We use several experiments to illustrate the behaviour of the al- 
gorithms we introduced and gauge precisely the impact of each 
of the optimizations and implementation techniques we presented. 
Due to space constraints, we do not try to give in this paper the bare 
performances of our implementation. The interested reader can re- 
fer to 1 1 1 where a large experimental section compares our imple- 
mentation to state of the art query engines (MonetDB/XQuery and 
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Figure 4: Impact of the jumping and memoization on query 
evaluation time 



Qizx/DB), for a richer set of queries (both tree oriented and text 
oriented). Nevertheless, we provide for the sake of completeness 
a comparison of our implementation with the MonetDB/XQuery 
engine in Appendix [P) 

Implementation Our implementatiorj^ features a bottom-up with 
top-down pre-processing evaluation function ("top-down-l-bottom- 
up" as we refer to it in the rest of the section) which uses the jump- 
ing primitives described in [1]. These indexes support jumping to 
the first descendant and following nodes whose label is in a set L in 
time 0{\L\). As for the hybrid evaluation function, due to the lack 
of upward-jumping functions in this index, it performs its upward 
part using only parent moves (instead of jumping to ancestors with 
particular labels). It however remains an effective strategy when 
one of the labels in the query has a low count (our index provides 
the global count of a label in constant time). 
Documents and Queries We used the XMark (T9), document gen- 
erator for our tests. We report our results for a document of size 
116MB. The tree oriented queries we used are given in Figure [2] 
QOl to Q09 are realistic queries for XMark documents, taken from 
the XPathMark benchmark |4|. QIO to Q15 allow us to illustrate in 
more details the behaviour of our ASTAs. 

Impact of Jumping and Memoization We report in Figure |4] the 
query answering time of our engine for each query (note the loga- 
rithmic scale for the times). The"Naive Eval." series represents a 
straightforward execution of Algorithm |4.1| As we can see, a naive 
evaluation where the \Q\ factor has to be paid for each node, and 
which potentially visits every node in D is not satisfactory. For 
queries where a "//" occurs at top-level, the full document needs 
to be traversed, yielding an evaluation time from Is to 10s. The 
"Jumping Eval." series represents a run where the evaluation func- 
tion computes the top-down approximation of relevant nodes on- 
the-fly and jumps only to these nodes. No memoization occurs 
therefore the \Q\ factor is paid for each visited node. As expected, 
this is a huge improvement compared to the naive case. With this 
optimization alone, all the tested queries require less than 150ms to 
evaluate, an improvement of ten to hundred-folds. The "Memo. 
Eval." series represents runs where on-the-fly computations are 
memoized. For these runs, the \D\ factor is paid in full (unless 
the automaton can skip whole subtrees as in QOl) while the \Q\ 
factor is amortized. This technique also improve query answer- 
ing time considerably: a full traversal takes no more than 450ms. 
The fact that only firstChild and nextSibling moves are used also 
demonstrate that alternating automata are a framework of choice, 
even over pointer-based data-structures. Lastly, the "Opt. Eval" se- 
ries represents runs where both optimizations are enabled. We can 

'Our implementation is written in OCaml (for ASTA/XPath query 
part) and C-l~l- (for the indexes). Our test machine is described in 
Appendix [D| 
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(1) number of selected nodes 
number of nodes visited by: 

(2) an hybrid run 

(3) a top-down-l-bottom-up run 



A : 75021 listitem, 3 keyword below listitems (3 in total) and 

4 emphs below those 3 keywords; 
B : 75021 listitem, 60234 keyword below listitems (60234 in 

total) and 4 emphs below those keywords; 
C : 9083 listitem, one keyword below listitems (40493 in total) 

and 65831 emphs below one of the keyword below a listitem; 
D : 20304 listitem. 10209 keyword below one listitem (10209 

in total) and 15074 emphs below one of those keyword. 

Figure 5: Selected and visited nodes for the hy- 
brid and top-down evaluation procedures, for query 
//listit em/ / keyword/ / emph 

see that they are complementary: with the exception of QOl and 
Q12, the "Opt. Eval" time is always better (at least twice as fast) 
as either optimization taken individually. QOl and Q12 are a very 
particular case where the query only touches two nodes therefore 
the transitions memoized in the look-up table are never re-used and 
their insertions only constitute an overhead. 

Top-Down Relevance Approximation, Automata Logic and Mem- 
oization: the table in Figure [3] gives the number of selected nodes 
(Line (1)). These numbers are to be contrasted with Line (2), which 
represents the number of nodes visited by a jumping function (that 
is, the size of the approximated set of relevant nodes). For realis- 
tic queries (Q01-Q09 with the exception of Q08), the number of 
selected nodes is more than 10% of the number of visited nodes 
(this ratio is given at Line (5)). Of particular interest is Q05. For 
such a query, and while the automaton is given in an alternating 
and non-deterministic way, we end up touching exactly the number 
of relevant nodes (the top-most listitems and the keywords 
below them). This number can be contrasted with the total num- 
ber of nodes (more than 5 millions), most of which are completely 
ignored by the evaluation function. 

Line (3) shows also that for a non-jumping algorithm, our evalu- 
ation function skips, when possible, a large number of subtrees. Of 
course it is necessary to traverse the whole document as soon as a 
top-level "//" is present. 

The automata logic is better highlighted by looking at Line (2) 
for query QIO to Q15. Here, it is clear that predicates are effi- 
ciently checked. For Qll, Q12 and Q13, the predicate check is 
done together with the accumulation of keyword nodes, and no 
extra relevant node is touched. For query Q14 and Q15, only a 
small number of nodes (1 and 2 respectively) are touched in order 
to satisfy the predicate. Of course, the predicate need not be ap- 
plied to root node, such optimizations are performed for any kind 
of conditions, regardless of their position in the query (it is easier 
to illustrate them on the single root element). 

Lastly, Line (4) represents the number of entries added to the 
memoization table, or equivalently the number of nodes for which 
the evaluation function paid a jQj factor (whereas all the others 
consisted of a constant-time look-up). For practical queries, the 
size of such tables is very small and the speed-up they generate is 
worth the small memory overhead (a few kilo-bytes at most). 
Hybrid Traversal Figure [5] describes the behaviour of the hybrid 
evaluation function for four particular configurations of XMark doc- 
uments that we manually created. 

We consider the query //list it em/Zkeyword/Zemph and change 
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(1): Number of selected nodes (2): Number of visited nodes with jumping (3): Number of visited nodes without jumping 

(4): Number of memoized transitions (5): Ratio of selected nodes vs. approximated top-down relevant nodes (in %) # nodes = 5673051 

Figure 3: Number of selected and visited nodes (w and wo jumping), and number of memoized configurations 



the proportion and placement of the listitem, keyword and 
emph elements. For each such configurations (A to D), we re- 
port the query evaluation time for an hybrid run and for a regular 
top-down-l-bottom-up run. We also report the number of nodes se- 
lected by the query and the number of nodes visited by both strate- 
gies. Configuration A and B represent the best cases for the hy- 
brid traversal: one of the label in the query has a very low global 
count. In A, the count of keyword nodes is small, the evalua- 
tion starts at these nodes, checks in a pure bottom-up fashion that 
they have a listitem ancestor and collect their emph descen- 
dants. For configuration B, the hybrid run actually performs a pure 
bottom-up run of the query, starting at emph nodes. Both visit very 
few nodes compared to the relevant nodes approximated by the top- 
down-l-bottom-up evaluation (Line (3)). Configuration C represents 
a case where the hybrid behaves like the top-down-l-bottom-up run, 
since the global count of keyword elements is low. Lastly, Con- 
figuration D is the worst-case scenario, where keyword as the 
lowest global count, but which is close to the number of listitem 
elements. Even though the top-down-l-bottom-up visits more nodes, 
it is twice as fast thanks to its use of jumping primitives. While this 
particular experiment seems artificial, configuration A and B ac- 
tually simulate the behaviour of text-oriented queries, where the 
text predicate is often very selective. Such queries where investi- 
gated in m, where the same hybrid procedures yields significant 
improvement over state of the art text-aware XPath engines. 

6. CONCLUSION 

We have presented an effective way to reduce the number of 
nodes traversed during the evaluation of a navigational XPath query, 
using the novel notion of relevant nodes for an automaton. We have 
shown that this notion, coupled with a wide range of implementa- 
tion techniques made alternating selecting tree automata a compi- 
lation target of choice for XPath queries, yielding execution speed 
on par with the best XPath engines available. While we have only 
focused our presentation on forward Core XPath, our prototype ac- 
tually implements backward axes (by adding "up-moves" to formu- 
las of the ASTA which are rewritten into down moves on-the-fly) 
and XPath 1.0 functions. Unfortunately "up-moves" are not part of 
the theory and present two problems. The first one is that we do 
not have yet a sound approximation of relevant nodes in the pres- 
ence of up-move (therefore we cannot jump). The second, more 
troublesome one is that with the presence of up-moves, a single 
top-down followed by a bottom-up pass is not sufficient in general, 
one needs an extra top-down pass (as observed by Koch in (10|), 
or require more book-keeping operations in the result sets. XPath 
1.0 functions are naively treated as black-boxes which are called 
during formula evaluation. This defeats some of the automata op- 
timizations since a query '7/a[ count(.//b) ]//c" gets compiled into 
three separate automata. 

As future work, we plan to generalizes the top-down approxima- 
tion to backward axes (it seems possible since ASTAs are known 
to not gain any expressive power with the addition of up-moves), 
extend the work in yj to not only handle efficiently text predi- 



cates but also numeral predicates, context dependent functions (e.g. 
"position ( ) ") and data joins. 
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APPENDIX 

A. SELECTING TREE AUTOMATA 

We consider an example of a BDSTA for which there is no equiv- 
alent top-down deterministic STA. 

Example A.l Let A/a[.//b] = (S, Q, T,B,S, S) where E — {a, b, c}, 
Q = {go.gi}, T = {qo,qi}, B = {go}, S = {(gi,a)}, and 5 
consists of the following eight transitions. A transition {q, x, q' , q") G 
S is now written in the form q -k— x, {q' , q"). Let _ denote any state 
in {go,gi}- 

qi^ {b},(go,-) 91 <^ {b},(gi,_) 
go^S\{b},{go,-) gi^S\{b},(gi,_) 

The automaton A//ai//h] accepts the set of all trees: C{A/ /b] ) ~ 
r(E). Moreover, A//s.[j/b] is a bottom-up complete BDSTA. It se- 
lects all the a-nodes that have a b-node in their left subtree. In 
terms of XML, this automaton realizes the XPath query //a[.//b]. 
We claim that there is no top-down deterministic STA equivalent to 
A//a.[.//b] of Example 1 2. 1[ Intuitively, the top-down automaton does 
not know whether or not to select an a-node, because this depends 
on the left subtree of that node, which has not yet been processed 
by the automaton. 

Definition A.l (Reacliable state) If a state q' appears in the right- 
hand side of a rule with q in its left-hand side, then we say that 
q one-step reaches q' , denoted by q — q' . We denote by — 
the reflexive transitive closure of -^a, and say that q reaches q if 

g -^A q- 

We give the formal definition for the notation A\q\ of the restric- 
tion of an automaton to a state. 

Definition A.2 (Restriction of an automaton to a set of state) Let 

A ~ {Ti,Q,T ,B,S ,5) and {gi ,...,§„} C Q, the restriction of 
A to {qi, . . . ,qk} is the STA 

A[q,,---,q^] = {^,Q',T',B',S',S') 

where T' — {gi, . . . , g,i}, Q' is the set of the states reachable 
from r', i.e., Q' = {q' € Q \ 3q € T' , q ~^a g'}. and B' , 
S', and S' are the restrictions to the states in Q' of B, S, and 5, 
i.e., B' ^ Bn Q', S' = {{q,l) e S \ q € Q'}, and S' = 
{{q,L,qi,q2) G 5 \ g G Q'}. 

A.l Relating STAs to Ordinary Tree Automata 

In the next section we will characterize, for a given tree t G 
r(E), the nodes of t that are "relevanf for a the STA A. Intuitively, 
a node is relevant if A changes its state at that node. However, it can 
be that the STA A is "badly programmed" and changes its states at 
more places than is actually necessary for the query. Were therefore 
want to consider the minimal automaton A' that is equivalent to 
A, where minimal means with the least number of states. It is 
well-known that for every ordinary deterministic tree automaton 
(TA) there is an equivalent unique minimal one, and that it can be 
computed in quadratic time. Instead of inventing and proving a new 
minimization procedure for STAs we prefer to encode them into 
ordinary tree automata in such a way that the encoding allows us to 
obtain a minimal STA from the minimal encoded automaton. Thus, 
we reduce minimization for STAs to minimization for ordinary tree 
automata. 

We require that the STA A is either top-down or bottom-up com- 
plete. To encode an STA into a TA, we simply encode the se- 
lection of a node through special labels. We define the alphabet 
E = {a I (J G E}. Now, if A selects a node in a given tree (with 



label I), then the TA A associated to A accepts a tree that has the 
label I at that node. Formally, 

A= (EuE,Q,r,e,0,i5) 

where S is defined as follows. Every transition {q, L, qi, q2) G 5 
such that there exists anl £ L with {q,l) G <S is changed into the 
new transition (q, L' , qi, 52) of A where L' — {I £ L \ (q, I) ^ 
5} (if L' = then the transition is removed), and additionally 
we add the new transition (g, L, qi, 52) to S where L — {L \ I £ 
L, (q, I) G S}. Finally, we make the automaton obtained so far 
complete: for every g G Q let L(g) = {a G E U E | (5(g) / 0} 
and, if L{q) 7^ then add the transition (g, L{q),q±, gx) to S. For 
the new sink state q± we add the transition (g_L, E U E, gx, gx) to 
S. It should be clear that 

(1) for every t G jC-{A) there exists a tree t' £ A which is ob- 
tained from t by changing the label of every tt G A{t) into /, 
where I — t{TT). 

(2) for every t' G C{A) there exists a tree t £ A obtained by 
removing all hats, and, every node tt in t' that has a hat, vr is 
in A{t) 

If (1) and (2) hold for two automata A and A then we say that they 
are equivalent, denoted by ^ = ^. 

Example A.2 The recognizer associated with the STA defined in 
Example |2.1| is: 

^ = (E U S, {go, gi, gl}, {go}, {g'o, g'l}, 0, S) 

where 5 is defined as: 

go, {a} ->(g"i, go) gl, {b} U S \ {b}->(g''i, gl) 
go, S \ {a}->(g-o, g'o) g"i, {b} U_E \ {b}^(gl, gl) 
go,E ->(gl,gl) gl,EuE ^(gl.gl) 

The connection between an STA and its associated recognizer is 
quite strong, as we state in the following lemma. 

Lemma A.l Let A and A' be two STAs, defined over the same 
alphabet E. Then A = A' if and only if C{A) = C{A'). 

We have seen how to translate an STA into an ordinary tree au- 
tomaton. It should be clear that this translation preserves deter- 
minism. The translation is invertible: for any A automaton, one 
can build an equivalent (in the sense of Lemma p^.I[ ( ordinary tree 
automaton A. However, this inverse translation does not preserve 
determinism. Indeed, while both formalisms are equally expres- 
sive, they do not have the same behaviour. The automaton A only 
needs to verify that a tree in r(E U E) is in its language. This can 
always be done in a bottom-up deterministic way (it is folklore that 
bottom-up tree automata can be determinized, see f3 1). 

For our purpose, it is enough to observe that if a determinis- 
tic automaton A is "selecting-unambiguous", then it can be trans- 
formed into a deterministic SA. Formally, the tree automaton A = 
(E U E, Q, T, Z3, 0, 5) is selecting-unambiguous if and only if for 
every q £ Q, and for every t G £{A[q]): 

• if t(e) = CT G E, then t[e ^ cr] ^ C{A[q]) 

• if t(e) = a G S, then t[e cr] ^ C{A[q]) 

Lemma A.2 Let Abe a complete TA. Then A is selecting-unambiguous. 

Lemma A.3 Let A' he a complete selecting-unambiguous TDTA 
(resp. BDTA). There effectively exists a complete TDSTA (resp. 
BDSTA) A such that A = A'. 



Proof, (sketch) The proof builds the automaton A as such. 
For each transition (q, L,qi, q2) G 5'. We spht the transition in 
two, (g, L', gi, 172) G 5' and (q,L" ,qi,q2) G S' where L' — 
L n E and L" = i n E (if L' or L" is empty, we just skip 
it). Since A' is marking-unambiguous, if cr G L' , then a ^ L" 
(and vice versa). If neither qi nor g2 is a sink state, then we add 
(g, L', gi, g2) G 5' as a transition to 5 and if L" = {ai, . . . , cfk} 
we add (g, {tJi, . . . , a/..}, gi, g2) to S and (g, a^) to S. Once this 
is done for all transitions, we remove all unreachable states and we 
obtain A. 

Now that we have established a precise correspondence between 
STAs and TAs we get for free some properties of TAs, such as 
minimization. 

A.2 Minimization 

As mentioned before, minimimal here means, the smallest num- 
ber of states. Given a BDTA A = (E, Q, T, B, 5), the standard 
algorithm for minimization (see, e.g., |3 |) builds the set of equiv- 
alence classes for every state in Q. Two states g and g' are in 
the same equivalence class if and only if £(^[g]) = £{A[q']). 
The algorithm initializes the set of equivalence classes with Eo — 
{Q\T, T}. The intuition is that final and non-final states are not 
in the same equivalence classes (indeed, if g is a final state and g' 
not a final state, then ^[g] accepts the null tree # while A[q'] does 
not, hence /I(^[g]) 7^ C{A[q'])). The algorithm proceeds then to 
refine the equivalence relation. We note g En q' the fact that g and 
g' are equivalent in the equivalence relation E„, that is there exists 
S G En such that q £ S and q £ S. From £"„ the algorithm 
computes a finer equivalence relation En+i such that g En+\ q if: 

• q En g'; 

• Vcr G E,Vgi,g2 G QS{ql,q,l) = 5(gl, g', i) and (5(g, g2, /) = 
5{q',q2,l). 

The procedures stops when En = En+i- The case of TDTA is 
similar. 

Of course we would like, given a selecting automaton A, to com- 
pute is associated recognizer A, minimize it using the standard pro- 
cedure and translate it back into a selecting automaton. However, 
as we have seen, translating a recognizer into a selecting automaton 
does not always preserve determinism. Fortunately, we can show 
that the property of selecting unambiguousness is preserved by the 
minimization procedure. 

Lemma A.4 Let A be a complete TDTA (resp. BDTA) over the 
alphabet E U E. Let Amin be the minimal automaton such that 
C{Amin) = C{A). If A is selecting-unambiguous, then so is Amin- 

Proof. First let us remark than since A is selecting-unambiguous, 
thenVg £ Qyt £ C{A[q]), if t(e) = a G E then f[e o"] ^ 
C{A[q]) andift(e) = a£ E then t[e ^ cr] ^ C{A[q]). 

Now suppose that there are two states gi , g2 G Q such that 
3a{t-i,t2) £ C{A[qi\) sad3a{ti,t2) £ £(^[g2]). Ami,, is selecting- 
unambiguous if and only if gi and g2 are not in the same equiva- 
lence class (if they where, then there would be a state in g' G Qmi„ 
for which the selecting-unambiguous property do not hold, the state 
representing the equivalence class of gi and g2). We must there- 
fore show that /I(^[gi]) 7^ /I(^[g2]) This is immediate: since A 
is selecting unambiguous, and since o{t\,t2) £ jC(.4[gi]), then 
<5"(ti,f2) ^ C{A[qi]). However (T(ti, G /I(yt[g2]) and there- 
fore /:(^[gi]) / /:(i~[g2]). 

Using this lemma, we can state the existence of a minimal se- 
lecting tree automaton. 



Theorem A.l Let Abe a complete TDSTA (resp. BDSTAj. There 
effectively exists a complete TDSTA (resp. BDSTA) A„,i„ which is 
equivalent to A and no other equivalent TDSTA ( resp. BDSTA ) has 
less states than Ami,,. 

Theorem | A. 1 [ states the existence of a minimal selecting automa- 
ton and also give a way to compute it. Indeed, it is sufficient to 
translate a selecting automaton into a recognizer, minimize the lat- 
ter and transform it back into a selecting automaton. However, the 
proof of Lemma p^.4| hints us toward a more direct method. Indeed 
in a recognizer, if a state gi accepts some tree cr(fi, 12) and a state 
g2 accepts the tree cj(t\,t2), then gi and g2 are in different equiv- 
alence classes. In the transformation from recognizer to selecting 
automaton, g2 , cr becomes a selecting configuration. Therefore, if 
two states gi and g2 are such that gi , cr ^ S and g2 , cr G S then 
these two states are not in the same equivalence class. Minimiz- 
ing an selecting automaton can therefore be achieved by using the 
standard algorithm, but where the initial relation Ea is: 

{gG Q I g G J^,g G 5}, 
P _ r {g G Q I g G J^,g ^ 5}, 
"""-^ {gGQ|g^-r,gG5}, 

{gGQ|g^J^,gG5} 

Here T stands for the set of final states, that is T for BDTAs and B 
for TDTAs. 

B. RELEVANCE 

B.l Top-Down Relevance 

Algorithm B.l (Top-down traversal with jumping) 
Input: Minimal TDTA = (E, Q, X, 5, 5) and a tree t 
Output: (possibly empty) Mapping from nodes of t to states of A. 



1 let foHowingin, L, 7ro)= 

2 ifTT = Q then return 

3 else return {n}U following{ft{n, L, ■kq),L,-kq)\ 
4 

5 let relevant Jiodes (t, n, q) = 

6 if 3L C S, (q, L, q,q) £ 5 and -^isjnarking(q) 
1 then { L' :=E\ L; 

8 if t(7r) G L' then return {tt}; 

9 tt' :=dt(7r,L'); 

10 return {7r'}U/o//ow(7r',L', vr) 

11 } else 

12 if EIL C S, (g, L.g.gx) £ <5 

13 and is Ainiversal (q-f ) and ^isjnarking(q) 

14 then{L' :=E\L; 

15 if t(7r) £ L' then return {tt}; 

16 tt' :=lt(7r,L'); 

17 if tt' = ri then return else return {tt'} 

18 } else 

19 if 3L C E, (g, L, gx: I?) e <5 

20 and is_universal (gx) and -^isjnarkingiq) 

21 then { L' := E \ L; 

22 if t{-K) £ L' then return {tt}; 

23 tt' :=lt(7r,L'); 

24 if n' = Q then return else return {tt'} 

25 } else 

26 return {tt}; 
27 

28 let td_jump_rec (tt, g) = 

29 I: = t{T7); 

30 if/ = #then 

31 if q £ B then return {tt i— ^ q} 

32 else (/irow Failure 

33 else{ 

34 {gi,'?2} :=5(q,0; 



35 if is^ink (qi) or is_sink(q2) then throw Failure; 

36 Inodes := relevant-nodes (t,n ■ l,qi); 

37 modes := relevant Jiodes (t, n ■ 2, 52); 

38 return {tt I— > g} U topdown_jump_rec{-Ki,qi) 

39 '-^ U topdownJwnp_rec(-K2,q2)\ 

40 } 
41 

42 let topdownJump(t,{'E, Q, T , {g}, cS, (5)) = 

43 try{ 

44 nodes := relevant jiodes (t, e, q); 

45 return topdownJump_rec{iT, q)\ 

IT Anodes 

46 } catch (Failure) { return } 

B.2 Bottom-Up Relevance 

Algorithm B.2 (Bottom-up evaluation) 

Input : A BDTA A = {T:,Q,T, {qo},S, S} a tree t a sequence 

So = (to, go), (7ri,go), (Tn.go) 

where the tt; are the leaves of t in pre-order. 
Output : A mapping from nodes of t to states of A 

1 let bottom_up_rec {S, t, R) = 



switch S { 
case (ir, g): 

if 7r = e and g G 7~ 

then return {e 1— > g} U RA)', 

else throw Failure; 



case (tti, (jri),(7r2, (?2),'S'': 
if siblings (ni , 112) 
then { 

7r := parent tti ; 
{<j} :=5(gi,g2,tW); 
return bottom_up_rec({(-K , q), S'), t, {tt 1— >■ q} U R) 
} else { 

R',S" := bottom.up.rec(({TT2,q2), S'), t, R); 
return bottoniAipj'ec (((tti, q\), S"), t, R'); 

17 case (): 

18 return _R,(); 

19 } 
20 

21 let hottoni-up {t, So, A) = 

22 try { 

23 R,_ := bottom_up.rec(So,t, {tt n> g | (tt, q) S So}); 

24 return _R; 

25 } catch (Failure) 

26 return 0; 



Example B.l 

A/a[.//bi = ({a, b, c}, {go, qi}, {go}, {go, gi}, {(gi, a)}, 5) 
E Q r s s 

A transition (g, L, q , g") G 5 is written in tlie form g L[q' , q") 
and the wildcard _ denotes any state in {go, gi}. 5 is defined by: 



gi^ {b},(go,_) gi< 
go^E \ {b}, (go, -) gi< 



-E\{a},(gr,_) 



A run of this automaton on an input tree is given in Figure |6] 
This automaton selects all the a-labelled node which are above a 
b-labelled node. The selected nodes are circled and the relevant 
nodes are underlined. As in the general case and the TDSTA case, 
selected nodes are relevant. Otherwise, we can remark that any 
subtree whose root is in state go contain only non relevant nodes. 




Figure 6: Bottom-up run of automaton A from Example |B.l| 

In the case of minimal BDSTAs, the state go allows to skip sub- 
trees (as gx for TDSTAs). Indeed in a minimal BDSTA, go is the 
only state which accepts a null-tree But conversely, any subtree 
which is recognized in go could be replaced by a null-tree without 
changing the semantics of the query. Thus, skipped subtrees are 
those whose root is in state go. For skipping nodes along a path, 
the same conditions as previously apply: either the automaton re- 
mains in the same state for a node and both its children, or the root 
and one of its children are in the same state and the other children 
can be skipped, that is, is in state go. 

C. AUTOMATA FOR XPATH 

Definition C.l (XPatli fragment) An XPath expression is a finite 
production of the following grammar, with start symbol Core: 



Core 

LocationPath 
LocationStep 

Pred 

Axis 

NodeTest 



LocationPath | 7' LocationPath 
LocationStep (V LocationStep)* 
Axis '::' NodeTest 
I Axis NodeTest '[' Pred 
Pred 'and' Pred | Pred 'or' Pred 
I 'not' '(' Pred ')' | Core | '(' Pred ')' 
descendant | child 
I following-sibling | attribute 
tag I * I node ( ) | text ( ) 



The following example clearly shows why using normal STAs would 
cause an exponential blow-up: 

Example C.l Consider the XPath query: 

//x[ (ai or a2) and ... and (a2n-i or a2n) ] 
where the a; are pairwise distinct labels. The ASTA for this query 
is: 

q-A, {x} =>(4-l gaiV 4,1 gaa) A . . . A (4-1 '?a2„_iV 4-1 9a2„) 

Ijx, S -S>4.i iJxV 4,2 <?x 

IJa,, (gaJ^T 

ga . , S -S>4.2 gai 

This ASTA has: 2 - n + 1 states, 4 ■ n + 2 transitions, one of length 
2 • n and the other of fixed length (less than 3). It is well known 
that converting this ASTA into an STA yield an exponential blow- 
up (since one has to compute the disjunctive normal form of the 
formulas; for the first transition, the DNF has size 2"). 
Evaluation of formulas and node selection: We define the notion 
of result sets an the semantics of the evaluation of formulas, which 
also handles node selection. 

Definition C.2 (Result set) Let A = (E,Q,T,5) be an ASTA 
and t G T{T,). A result set is a mapping from states in Q to sets 
of nodes in T)om{t). Given a mapping P, we denote by r(g) the 
set of states associated with g (the empty set if g is not in T>om{T)) 



(true) Ti.Ta ^ (b, 



100.0 Time (ms) 



Ti, r2 \-A (j)! = (fei, r'l) 
Ti, r2 \-A 02 = (62, 



Ti, Ta V 02 = (61, r'l) ® (62, r^) 

ri,r2 0i = (&i,ri) 
ri,r2h^02 = (&2,r^) 
ri,r2h^0iA02 = (bi,r'i)®(62,ri,)' ' 



ri,r2 h^i, g = (T,r(g)) 

when no other rule appUes 

ri,r2 h^0 = (±,0) 



fori G {1,2} (left,right) 



where: 

T : 



: T 



(6i,ri)® (fe2,r2) 
(6i,ri)® {62,r2) 



T,ri if6i = T, 62 = J- 

T,r2 if62 = T, bi=± 

,ri ur2 if 61 = T, 62 = T 

± , otherwise 

,ri ur2 if fei = T, 62 = T 

± , otherwise 



Figure 7: Inference rules defining tiie evaluation of a formula 

and we define the union of two mappings as: 

(riur2)(g) = ri(g)ur2(g) 

We can now define the evaluation of a set of transitions for an 
automaton. 

Definition C.3 (Evaluation of a set of transitions) Let 

A = {E,Q,T,5) 
be an ASTA, t £ a tree and Trs C 5 a set of transitions. The 

evaluation of Trs for a node tt G T>om(t) is a result set given by the 
function: 

eval_trans(ri, r2, TT, Tri) — 

y {g^5|ri,r2h^0 = (T,S)} 

{q,L,^,4,)eTrs 

U IJ {gK^{7r}uS|ri,r2h^0=(T,S)} 

(q,L,^,<j>)<iTrs 

where Fi and r2 are result sets, and Fi, r2 \-a 4> ~ {b, S) is the 
judgement derived by the rules in Figure|7] 

These rules are pretty straightforward and combine the rules for 
a classical alternating automaton, with the rules of a marking au- 
tomaton. Rule (or) and (and) implements the Boolean connective 
of the formula and collect the marking found in their true sub- 
formulas. Rules (left) and (right) (written as a rule schema for 
concision) evaluate to true if the state g is in the corresponding set. 
Intuitively, states in Fi (resp. F2) are those accepted in the left 
(resp. right) subtree of the input tree. To handle selection, we pro- 
ceed as follows. Assuming the left subtree returned a result set Fi 
and the right subtree a result set F2 : 

(1) For each g, i => such that evaluates to T (4-i g' evaluates 
to T if g' G I'o;)j(Fi)), add the mapping g 1— >■ {tt} to F; 

(2) For each g, i — >■ or g, L =^> 0, for which (j) evaluates to T, 
if i-i q' & 4> evaluates to T, add the mapping g t-^ to F. 

This is done by the function eval_trans Informally we remem- 
ber each node which was selected by a particular transition (1) and 
for each selected node in state g' we propagate it to g if it con- 
tributes to the truth of a formula proving q. The selected nodes 
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Figure 8: Query answering time for the SXSI and MonetDB 



which gets propagated to a state in T are therefore part of an ac- 
cepting run and constitute the result of the query. If we take the 
example run given in Figure [T| of Section |4] node selection is per- 
formed as follows. Consider the rightmost c node in the figure 
(*). This node was entered in state {go, gi, g2}, therefore the ac- 
tive transitions for it are: 
{go, E goV 12 go; gi, E giV I2 gi; g2,{c}-)>T; 

g2, S — 5>|2 g2} 

and the result sets for its left and right subtrees are (since the 
calls to both left and right move failed). In this environment only 
the third transition is satisfied, the result set returned for this node 
is therefore Fi = {g2 0}. Returning from the recursive calls, 
we arrive on the b node above it, for which the active transitions 
are: 

{go, S goV I2 go; gi, {6} 92; gi, S giV I2 gi; } 
Evaluated under the results (Fi, 0) for the left and right subtrees, 
only the second transition is satisfied. Furthermore, this transition 
is a selecting one, it therefore returns result set F2 = {gi 1— >■ {tt;,}} 
where ni, is the identifier of this node. The parent of this b node 
is again a b node where the same transitions are active. However 
the result sets for the left and right subtrees are (0,F2). Under 
these hypothesis only the third transition can be satisfied (and it is 
a not a selecting one). The current b node is therefore not selected, 
but the result set is F3 — {qi 1— > F2(gi)} (since ^2 gi evalu- 
ated to T during the evaluation of the third transition). We have 
~ {gi {TTfc}}- We now move onto the a parent of this b 
node, where the active transitions are: 

{go, {a} -^li gi; go, E goV I2 go; } 
evaluated under the assumptions (F3,0). Here the first formula 
evaluates to T, yielding the result set F4 = {go i-)- F3(gi)} = 
{go {jTb}}. We now see that the node {nt} has been "pro- 
moted" to state go. Using this technique we can ensure that nodes 
selected non-deterministically during the bottom-up run are kept 
only if they propagate up to the starting state go, in which case they 
are part of the result. 

D. EXPERIMENTS 

Experimental Setup tests were executed on an Intel Xeon Core 
2 Duo, 3 Ghz, with 4GB of RAM. We used Ubuntu Linux 9.10 
distribution, with kernel 2.6.32 and 64 bits userland. Our imple- 
mentation was compiled using g-l~l- 4.4.1 and OCaml 3.11.1. We 
used version v4.34.0 of the MonetDB Server, with 32 bits OIDs. 
Experimental results for query QOl to Q15 are given in Figure [8] 
For both engines, the results was materialized in memory but not 
serialized. We took the best of 5 consecutive runs for each query. 



